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Abstract 

The concept of the value-gradient is introduced and developed in the context of re- 
inforcement learning, for deterministic episodic control problems that use a function ap- 
proximator and have a continuous state space. It is shown that by learning the value- 
gradients, instead of just the values themselves, exploration or stochastic behaviour is no 
longer needed to find locally optimal trajectories. This is the main motivation for using 
value-gradients, and it is argued that learning the value-gradients is the actual objective 
of any value-function learning algorithm for control problems. It is also argued that learn- 
ing value-gradients is significantly more efficient than learning just the values, and this 
argument is supported in experiments by efficiency gains of several orders of magnitude, in 
several problem domains. 

Once value-gradients are introduced into learning, several analyses become possible. 
For example, a surprising equivalence between a value-gradient learning algorithm and a 
policy-gradient learning algorithm is proven, and this provides a robust convergence proof 
for control problems using a value function with a general function approximator. Also, 
the issue of whether to include 'residual gradient' terms into the weight update equations 
is addressed. Finally, an analysis is made of actor-critic architectures, which finds strong 
similarities to back-propagation through time, and gives simplifications and convergence 
proofs to certain actor-critic architectures, but while making those actor-critic architectures 
redundant. 

Unfortunately, by proving equivalence to policy-gradient learning, finding new diver- 
gence examples even in the absence of bootstrapping, and proving the redundancy of 
residual-gradients and actor-critic architectures in some circumstances, this paper does 
somewhat discredit the usefulness of using a value- function. 

Keywords: Reinforcement Learning, Control Problems, Value-gradient, Function ap- 
proximators 



1. Introduction 



Rein forcement learning (RL) algorithms frequently make use of a value function (ISutton and Bartd . 
I1998T I. On problem domains where the state space is large and continuous, the value func- 
tion needs to be represented by a function approximator. In this paper, analysis is restricted 
to episodic control problems of this kind, with a known differentiable deterministic model. 

As ISutton and Bartol (jl998l l stated: "The central role of value estimation is arguably 
the most important thing we have learned about reinforcement learning over the last few 
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decades". However the use of a value function introduces some major difficulties when 
combined with a function approximator, concerning the lack of convergence guarantees to 
learning. This problem has led to a major alternative RL approach whic h works w i thout 
a value function at all, i.e . policy - gradie nt learning (PGL) algorithms ( Williams . 19921 ; 
Baxter and Bartlettl . l200d : IWerbosl ll99Ch . which do have the desired convergence guar- 



antees. In this paper, a surprising equivalence between these two seemingly different ap- 
proaches is shown, and this provides the basis for convergence guarantees to variants of 
value function learning algorithms. 

It is the central thesis of this paper that for value function methods, it is not the values 
themselves that are important, but in fact the value- gradients (defined to be the gradient of 
the value function with respect to the state vector). We distinguish between methods that 
aim to learn a value function by explicitly updating value-gradients from those that don't 
by referring to them as value- gradient learning (VGL) and value-learning (VL), respectively. 
The necessity of exploration to VL methods is demonstrated in section 11.31 which becomes 
very apparent in our problem domains where all functions are deterministic. We call the 
level of exploration that searches immediately neighbouring trajectories as local exploration. 
This requirement for local exploration is not necessary with VGL methods, since the value- 
gradient automatically provides awareness of any superior neighbouring trajectories. This 
is shown for a specific example in Section 11.31 an d proven in the general case in Appendix 
lAl It is then argued that VGL methods are an idealised form of VL methods, are easier to 
analyse, and are more efficient (sections 11.41 and 1 1 . 5 p . 

The VGL themselves are stated at the start of Section [2j One of these algorithms 
(Section 12. ID is proven to be equivalent to PGL. This is used as the basis for a VGL 
algorithm in a continuous-time formulation with convergence guarantees (Section 12.2)) . It 
also produces a tentative theoretical justification for the commonly used TD(A) weight- 
update, which from the author's point of view has always been a puzzling issue. 

The residual-gradients algorithm for VGL is given and analysed in Section 12.3) and new 
reasons are given for the ineffectiveness of residual-gradients in deterministic environments, 
both with VGL and VL. In Section O actor-critic architectures are defined for VGL, and 
it is shown that the value-gradient analysis provides simplifications to certain actor-critic 
architectures. This allows new convergence proofs, but at the expense of making the actor- 
critic architecture redundant. 

In Section H] experimental details are provided that justify the optimality claims, and 
the efficiency claims (by several orders of magnitude). The problems we use include the 
Toy Problem, defined in Section 11.1. 11 which is simple enough to be able to analyse in 
detail, and challenging enough to cause difficulties for VL. Examples showing diverging 
weights are given for all VL algorithms and some VGL algorithms in Section 14.31 Also a 
Lunar-Lander neural-network experiment is included, which is a larger scale neural network 
problem that seems to defeat VL. Finally, Section [5] gives conclusions and a discussion on 
VGL, highlighting the contributions of this paper. 



Va lue- gradients have already appeared in various forms in the literature. iDavan and Singh 
l|l99d ) argue the importance of value-gradients over the values themselves, which is the cen- 
tral thesis of this paper. 

The target value-gradient we define is closely related to the "adjoint vector" that appears 
in Pontryagin's maximum principle, as discussed further in Appendix [Al 
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The equations of Back-Pro pagation Through Time (BP TT, IWerbosI (119901 )) and Differ- 
ential Dynamic Programming ( Jacobson and Mavne . 1970l ) implicitly contain references to 
the target value- gradient, both with A = 1 (A is the bootstrapping parameter, defined in 
Section [LI]) . 

In continuous-time pro blem domain s, Doya ( 2000l ) uses the value-gradient explicitly 
in the greedy policy, and iBairdl ()1994l ) defines an "advantage" function that implicitly 
references the value-gradient. Both of these are discussed further in Section 12.21 A value- 
gradient appears in the Hamilton-Jacobi-Bellman Equation which is an optimality condition 
for continuous-time value-functions; although here only its component parallel to the tra- 
jectory is used, and this component is not useful in obviating the need for local exploration. 

H owever, the most s i milar work on value-gradients is in a family of algorithms ( Werbosl . 

19981 : IWhite and SofgeL ll992L eh. 13) referred to as Dual Heuristic Programming (DHP). 



These are full VGL methods, but are based on actor-critic architectures specifically with 
A = 0, and are more focussed towards unknown stochastic models. 



1.1 Reinforcement Learning Problem Notation and Definitions 

State Space, S, is a subset of 5ft n . Each state in the state space is denoted by a column 
vector x. A trajectory is a list of states {xo,x\, . . . ,xf} through state space starting at a 
given point xq. The trajectory is parametrised by real actions a% for time steps t according 
to a model. The model is comprised of two known smooth deterministic functions f(x,a) 
and r(x,a). The first model function / links one state in the trajectory to the next, given 
action at, via the Markovian rule xt+i = f( x 't, a t)- The second model function, r, gives an 
immediate real- valued reward rt = r(xt, at) on arriving at the next state xt+i- 

Assume that each trajectory is guaranteed to reach a terminal state in some finite time 
(i.e. the problem is episodic). Note that in general, the number of time steps in a trajectory 
may be dependent on the actions taken. For example, a scenario like this could be an aircraft 
with limited fuel trying to land. For a particular trajectory label the final time step t = F, 
so that xp is the terminal state of that trajectory. Assume each action at is a real number 
that, for some problems, may be constrained to —1 < at < 1. 

For any trajectory starting at state xq and following actions {ao, a±, . . . , ap_i} until 
reaching a terminal state under the given model, the total reward encountered is given by 
the function: 

F-l 

R(x ,a ,ai,...,aF-i) = r(x t ,at) 

t=o 

= r(x ,a ) + R(f(xo,a ),a 1 ,a 2 , ■ ■ ■ ,a F -i) (1) 



Thus R is a function of the arbitrary starting state xq and the actions, and this allows us 
to obtain the partial derivative §|f. 

Policy. A policy is a function tt{x,w), parametrised by weight vector w, that generates 
actions as a function of state. Thus for a given trajectory generated by a given policy w, 
at = ir(xt,w). Since the policy is a pure function of x and w, the policy is memoryless. 
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If a trajectory starts at state xq and then follows a policy ir(x, w) until reaching a 
terminal state, then the total reward is given by the function: 



F-l 



R*(x Q ,w) = y y r(x t ,n(xt,w)) 



t=o 



= r(x , n(x ,w)) + R w (f(x , tt(x ,w)),w) 

Approximate Value Function. We define V(x, w) to be the real-valued output of a 
smooth function approximator with weight vector w and input vector x0 We refer to 
V(x, w) simply as the "value function" over state space x, parametrise d by weights w. 
Q Value function. The Q Value function ( Watkins and Dayan . 19921 ) is defined as 



Q(x,a,w) = r(x,a) + V(f(x,a),w) (2) 

Trajectory Shorthand Notation. For a given trajectory through states {xq,x\, . . . ,x*f} 
with actions {ao,ai, • • • , ap-i}, and for any function defined on S (e.g. including V(x,w), 
G(x, w), R(x, ao, ai, . . . , ap-i), K*(x, w), r(x, a), V'(x, w) and G'(x, w)) we use a subscript 
of t on the function to indicate that the function is being evaluated at (xt,at,w). For 
example, r t = r(x t ,a t ), G t = G(x t ,w), R w t =R n (x t , w) and R t = R(x t , a t , a t+ i, . . . , ap_i). 
Note that this shorthand does not mean that these functions are functions of t, as that 
would break the Markovian condition. 

Similarly, for any of these function's partial derivatives, we use brackets with a sub- 
scripted t to indicate that the partial derivative is to be evaluated at time step t. For 
example, (§§) 4 is shorthand for §§1^^) i- e - the function evaluated at (xt,w). Also, 

for example, [§M = §^ ; f^5), = l N I; and similarly for other 

\ da J t " a (x t ,a t ) y ox)t \ Ox \{xt,at,at+\,.-,aF-\) J , J 

partial derivatives including (^§) t and ) t - 

Greedy Policy. The greedy policy on V generates n(x, w) such that 

n(x, w) = argmax(Q(^, a, w)) (3) 

a 

subject to the constraints (if present) that — 1 < a t < 1. The greedy policy is a one-step 
look-ahead that decides which action to take, based only on the model and V. A greedy 
trajectory is one that has been generated by the greedy policy. Since for a greedy policy, 
the actions are dependent on the value function and state, and V = V(x, w), it follows that 
7r = tt(x,w). This means that any modification to the weight vector w will immediately 
change V(x, w) and move all greedy trajectories. Hence we say the value function and 
greedy policy are tightly coupled. 

For the greedy policy, when the constraints — 1 < at < 1 are present, we say an action 
at is saturated if |ot| = 1 and ^ 0. If either of these conditions is not met, or the 

constraints are not present, then at is not saturated. We note two useful consequences of 
this: 



1. This differs slightly from some definitions in the RL literature, which would refer to this use of the 
function V as an approximated value function for the greedy policy on V . To side-step this circularity in 
definition, we have treated V(x, w) simply as a smooth function on which a greedy policy can be defined. 
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Lemma 1 If at is not saturated, then (j^J = and (^^^j < (since it's a maximum). 

Lemma 2 If at is saturated, then, whenever they exist, = and (§^) t = 0. 

Note that (§§) t and (§§-) t may not exist, for example, if there are multiple joint maxima 

in Q(x, a, w) with respect to a. 

e-Greedy Policy. In our later experiments we implement VL algorithms that require some 
exploration. Hence we make use of a slightly modified version of the greedy policy which 
we refer to as the e-greedy policyll 

7r(x, w) = arg max(Q(x, a, w)) + RND(e) 

a 

Here RND(e) is defined to be a random number generator that returns a normally dis- 
tributed random variable with mean and standard deviation e. 

The Value- Gradient - definition. The value-gradient function G(x, w) is the derivative 
of the value function V(x,w) with respect to state vector x. Therefore G(x,w) = 9V g^ w ^ ■ 
Since V(x,w) is defined to be smooth, the value-gradient always exists. 
Targets for V. For a trajectory found by a greedy policy tt(x, w) on a value function 
V(x,w), we define the function V'(x,w) recursively as 

V'(x, w) = r(x, vr(x, w)) + (\V'(f(x, w(x, w)),w) + (1 - X)V(f(x, tt(x, w)),w)) (4) 

with V'(xf, w) = and where < A < 1 is a fixed constant. To calculate V for a particular 
point xq in state space, it is necessary to run and cache a whole trajectory starting from 
xq under the greedy policy ir(x,w), and then work backwards along it applying the above 
recursion; thus V'(x,w) is defined for all points in state space. 
Using shorthand notation, the above equation simplifies to 

V't = n+ {XV't+i + (1 - \)V t+1 ) 

A is a "bootstrap ping" parame ter, giving full bootstrapping when A = and none when 
A = 1, described by Sutton ( 19881 ). When A = 1, V'(x,w) becomes identical to R n (x,w). 



For any A, V' is identica l to the "A-return", or the "forward view of TD(A)", described by 
Sutton and Barto ( 19981 ). The use of V' greatly simplifies the analysis of value functions 



and value- gradients. 

We refer to the values V't as the "targets" for Vt . The objective of any VL algorithm 
is to achieve Vt = V't for all t > along all possible greedy trajectories. By Eq. U] and for 
any A, this objective becomes equivalent to t he deterministic and und iscounted case of the 
Bellman Equation of dynamic programming ( Sutton and Barto . 19981 ): 



V t = V't W>0 ^ V t = r t + V t+ i Vt>0 (5) 

Since the Bellman Equation needs satisfying at all points in state space, it is sometimes 
referred to as a global method, and this means VL algorithms always need to incorporate 
some form of exploration. 



2. This differs from the definition of the e-greedy policy that ISutton and Bartol l| 19981 ) use. Also in the 
definition we use, we have assumed the actions are unconstrained. 
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We point out that since V' is dependent on the actions and on V(x, w), it is not a simple 
matter to attain the objective V = V, since changing V infinitesimally will immediately 
move the greedy trajectories (since they are tightly coupled), and therefore change V; these 
targets are moving ones. However, a learning algorithm that continually attempts to move 
the values Vt infinitesimally and directly towards the values V't is equivalent to TD(A) 



( Sutton . 19881 ). as shown in Section fl .2. 11 This further justifies Eq. HI 



Matrix-vector notation. Throughout this paper, a convention is used that all defined 
vector quantities are columns, and any vector becomes transposed (becoming a row) if it 
appears in the numerator of a differential. Upper indices indicate the component of a vector. 
For example, Xt is a column; w is a column; Gt is a column; ) t is a column; is 

a row; {jj^) ls a ma t r i x with element equal to ( ^g§r^~ J ! (§§)t * s a ma t r i x with 

element equal to (^^j^J • An example of a product is (jfe) Gt+i = Si (w) ^t+l" 

Target Value- Gradient, (G't)- We define G'(x,w) = 9V gg ,w ^ ■ Expanding this, using 
Eq. HI gives: 



".-((a + (£),(£)>((s)/(s).(s)j^^-^ 

(6) 

with G' f = 0. To obtain this total derivative we have used the fact that at = Tt(xt,w), and 
that therefore changing x% will immediately change all later actions and states. 

This recursive formula takes a known target value- gradient at the end point of a trajec- 
tory (G'f = 0), and works it backwards along the trajectory rotating and incrementing it as 
appropriate, to give the desired value-gradient at each time step. This is the central equa- 
tion behind all VGL algorithms; the objective for any VGL algorithm is to attain Gt = G't 
for all t > along a greedy trajectory. As with the target values, it should be noted that 
this objective is not straightforward to achieve since the values G't are moving targets and 
are highly dependent on w. 

The above objective G = G' is a local requirement that only needs satisfying along a 
greedy trajectory, and is usually sufficient to ensure the trajectory is locally optimal. This 
is in stark contrast to the Bellman Equation for VL (Eq. [5]) which is a global requirement. 
Consequently VGL is potentially much more efficient and effective than VL. This difference 
is justified and explained further in Section [1.31 

All terms of Eq. [6] are obtainable from knowledge of the model functions and the policy. 
For obtaining the term ^ it is usually preferable to have the greedy policy written in 
analytical form, as done in Section 12.21 and the experiments of Section [H Alternatively, 
using a derivation similar to that of Eq. [171 it can be shown that, when it exists, 

= ( - (H0 t (S?) t -1 if at unsaturated and (S?)' 1 exists 

®® ) t \ if at saturated 



If A > and if ^5 does not exist at some time step, to, of the trajectory, then G't is not 
defined for all t < to- In some common situations, such as the continuous-time formulations 

dn 
dx 



(Section 12. 2p . §| is always defined so this is not a problem. 
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Figure 1: Illustration of 2-step Toy Problem 

In the special case where A = 0, the dependency of Eq. [6] on §| disappears and so G't 
always exists, and the definition reduces to G't = (§§) t + (tM) Gt+i- In this case G't is 
equivalent to th e target value-gradient that Werbos uses in the algorithms DHP and GDHP 
dWerbod . [l~99sl Eq. 18). 



When A = 1, G't becomes identical to (p§^) t - appears e xplicitly in the equations 

of differential dynamic programming ( Jacobson and Mavne . 197Cll ). and implicitly in back- 
propagation through time (Eq. [T5l) . 

1.1.1 Example - Toy Problem 

Many experiments in this paper make use of the n-step Toy Problem with parameter k. 
This is a problem in which an agent can move along a straight line and must move towards 
the origin efficiently in a given number of time steps, illustrated in Fig. [H and defined 
generically now: 

State space is one-dimensional and continuous. The actions are unbounded. In this 
episodic problem, we define the model functions differently at each time step, and each 
trajectory is defined to terminate at time step t = n + 1. Strictly speaking, to satisfy the 
Markovian requirement and achieve these time step dependencies, we should add one extra 
dimension to state space to hold t and adjust the model functions accordingly. However 
this complication was omitted in the interests of keeping the notation simple. Under this 
simplification, the model functions are: 

.., if < t <n 

J fa, at) = S . e . ( 7a ) 

it t = n 




r{x t ,at) = S 9 ( 7b ) 

if t = n 

where & is a real- valued non-negative constant to allow more varieties of problem types to 
be specified. The (n + l) th time step is present just to deliver a final reward based only on 
state. The model functions in the time step t = n are independent of the action a n , and 
so each trajectory is completely parametrised by just (xq, clq, oi, . . . , a n -i)- This completes 
the definition of the Toy Problem. 

Next we describe the optimal trajectories and optimal policy for the n-step Toy Problem. 
Since the total reward is 



R(x ,ao,ai,...,a n -i) = -k(a 2 + a\ + . . . + a n -\) - (x n ) 2 



-k(a 2 + ai 2 + . . . + a„_i 2 ) - (xq + a + a\ + . . . + a 



n-lj 



\2 
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ABC D E F 



Figure 2: The lines A to F are optimal trajectories for the Toy Problem. The trajectories 
are shown as continuous lines for illustration here, although time is defined to be 
discrete in this problem. The arrowheads of the trajectories show the position of 
the terminal states. 



then the actions that maximise this are all equal, and are given by, 

at = for all < t < n. 

n + k 

Since the optimal actions are all equal and directed oppositely to the initial state xq, 
optimal trajectories form straight lines towards the centre as shown in Figure [2J Each 
optimal trajectory terminates at x n = ^r^- The optimal policy, usually denoted 7r*(xt), is 

7r*( Xt ) = ^—r for all < t < n (8) 

Tl t + K 

The optimal value function, usually denoted V*(x*t), can be found for this problem by 
evaluating the total reward encountered on following the optimal policy until termination: 



k(x t ) 2 



if t<n 



V*(x t ) = { n ~ t+k " " - (9) 
V ^ \0 if t = n + l K ' 

For a simple example of a value function and greedy policy, the reader should see Ap- 
pendix [B] (equations [33] and 13^) • 

1.2 Value-Learning Methods 

Introducing the targets V simplifies the formulation of some learning algorithms. The 
objective of all VL algorithms is to learn Vt = V't for all t > 1. We do not need to consider 
the time step t = since the greedy policy is independent of the value function at t = 0. A 
quick review of two common VL methods follows. 

All learning algorithms in this paper are "off-line" , that is any weight updates are applied 
only after considering a whole trajectory. In all learning algorithms we take a to be a small 
positive constant. 
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1.2.1 TD(A) Learning 

The TD(A) algorithm ( Sutton . 19881 ) attempts to achieve Vf 
following weight update: 

fdV\ 



Aw 



a 



E 

t>i 



V' t for all t > 1 by the 



(10) 



The equivalence of this formulation to that used by Sutton ( 19881 ) is proven in Appendix 
ICl This equ ivalence valida tes the V notation. Although not originally defined for control 
problems bv ISutton (Il988f ). it is shown in Appendix [D] that the TD(A) weight update can 
be applied directly to control problems with a known model u sing the e-greedy policy, and 
is then equivalent to Sarsa(A) (jRummerv and Niranianl . 11994 ). 

Unfortunately there are no convergence guarantees for this equation when used with a 
general function approximator for t he va lue function, and divergence examples abound. 

As shown by Sutton and Bartol ( 19981 ) TD learning becomes Monte-Carlo learning when 
the bootstrapping parameter, A = 1. 



1.2.2 Residual Gradients 



With the aim o f improving the convergence guarantees of the previous method, the approach 
of lBairdl (jl995T ) is to minimize the error E = ^ Y2t>i0^'t ~ ^t) 2 by gradient descent on the 



weights: 

—£((£).-(£)>-*> 

The extra terms introduced by this method are referred to throughout this paper as 
the "residual gradient terms". We extend the residual gradient method to value-gradients 
in Section 12.31 and extend it to work with a greedy policy that is tightly coupled with the 
value function. 



1.3 Motivation for Value- Gradients 

The above VL algorithms need to use some form of exploration. The required exploration 
could be implemented by randomly varying the start point in state space at each iteration, 
a technique known as "exploring starts" . Alternatively a stochastic model or policy could 
be used to force exploration within a single trajectory. Exploration introduces inefficiencies 
which are discussed in Section 11.51 

Figure [3] demonstrates why VL without exploration can lead to suboptimal trajectories, 
whereas VGL will not. Understanding this is a very central point to this paper as it is the 
central motivation for using value-gradients. If a fuller explanation of either of these cases 
is required, then Appendix IB1 gives details, and goes further in showing that the trajectory 
with learned value-gradients will also be optimal. 

The conclusion of this example, and Appendix [Bj is that without exploration, VL will 
possibly, and in fact be likely to, converge to a suboptimal trajectory. Exploration is 
necessary since it enables the learning algorithm to become aware of the presence of any 
superior neighbouring trajectories (which is exactly the information that a value-gradient 
contains). Without this awareness learning can terminate on a suboptimal trajectory, as 
this counterexample shows. 
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Jl-3 



-10 



Figure 3: Illustration of the problem of VL without exploration. In this diagram the floating 
negative numbers in the bottom left quadrant represent value function spot values 
(for a value function which is constant throughout state space). Hence, in the 
Toy Problem the greedy policy will choose zero actions, and the greedy trajectory 
(from A) is the straight line shown. The negative numbers above the x axis give 
the final reward, r n . Since the intermediate rewards rt are zero for all t < n, 
we have V't = —8 for all t < n. So Vt = V't = — 8 for all t < n, and so VL is 
complete; yet the trajectory is not optimal (c.f. Fig. [2]). This situation cannot 
happen with VGL, since if the value-gradients were learned then there would be 
a value-gradient perpendicular to the trajectory, and the greedy policy would not 
have chosen the trajectory shown. 



Note that this requirement for exploration is separate from the issue of exploration that 
is sometimes also required to learn an unknown model. 

Similar counterexamples can be constructed for other problem domains. For example, 
Figure [3] can be applied to any problem where all the reward occurs at the end of an episode. 

Furthermore, it is proven in Appendix that in a general problem, learning the value- 
gradients along a greedy trajectory is a sufficient condition for the trajectory to be locally 
extremal, and also often ensures the trajectory is locally optimal (these terms are defined in 
the same appendix). This contrasts with VL in that there has been no need for exploration. 
Local exploration comes for free with value-gradients, since knowledge of the value- gradients 
automatically provides knowledge of the neighbouring trajectories. 

1.4 Relationship of Value-Learning to Value- Gradient Learning 

Learning the value-gradients along a trajectory learns the relative values of the value func- 
tion along it and all of its immediately neighbouring trajectories. We refer to all these 
trajectories collectively as a tube. Any VL algorithm that exhibits sufficient exploration to 
learn the target values fully throughout the entire tube would also achieve this goal, and 
therefore also achieve locally optimal trajectories. 

This shows the consistency of the two methods, and the equivalence of their objectives. 
We believe VGL techniques represent an idealised form of VL techniques, and that VL 
is a stochastic approximation to VGL. Both have similar objectives, i.e. to achieve an 
optimal trajectory by learning the relative target values throughout a tube; but whereas 
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VL techniques rely on the scattergun approach of stochastic exploration to achieve this, 
VGL techniques go about it more methodically and directly. 

Figure H] illustrates the contrasting approaches of the two methods. 
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Figure 4: Diagrams contrasting VL by stochastic control (left figure) against deterministic 
learning by VGL (right figure). In the VL case (left figure), stochastic control 
makes the trajectory zigzag, and the target value passed backwards along the 
trajectory (which in this case is approximately —7, whereas the deterministic 
value should be —8) will be passed back to the points in state space encountered 
along the trajectory. In the VGL case (right figure), the value-gradient at the end 
point (which will be a small vector pointing in the positive x direction) is passed 
backwards, without any stochastic distortion, along the central trajectory and 
therefore influences the value function along all three trajectories simultaneously. 



Once the effects of exploration are averaged out in VL, the underlying weight update 
is often very similar to a VGL weight update; but possibly with some extra, and usually 
unwanted, terms. See Section [4.11 for an example analysis in a specific problem. Hence, we 
would expect a large proportion of results obtained for one method to apply to the other. 
For example, using a value-gradient analysis, in Section 14.31 we derive an example that 
causes VGL to diverge, and then empirically find this example causes divergence in VL too. 
Also, we would expect the analyses on residual gradients and actor-critic architectures to 
place limitations on what can be achieved with these methods when used with VL (sections 
EJandED. 

If the value-gradient is learned throughout the whole of state space, i.e. if = ^4- for 
all x, then this forms a differential equation of which the Bellman Equation (V = V for all 
x; see Eq. [5]) is a particular solution. This is illustrated in Equation II li 

G(x,w) = G\x,w) Vx <^=^ V(x, w) = V'(x, w) + c Vx (11) 

Of course the arbitrary constant, c, is not important as it does not affect the greedy 
policy. Hence we propose that the values themselves are not important at all; it is only 
the value- gradients that are important. This extreme view is consistent with the fact that 
it is value-gradients, not values, that appear in the optimality proof (Appendix [A]), in the 
relationship of value function learning to PGL (Section 12. ip and in the equation for ^= (Eq. 

LLU). 
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The directness of the VGL method means it is more open to theoretical analysis than the 
VL method. The greedy policy is dependent on the value- gradient but not on the values (see 
Eq. [T7] or Eq. 134")) , so it seems essential to consider value-gradients if we are to understand 
how an update to w will affect a greedy trajectory. If we define a "whole system" to mean 
the conjunction of a tightly coupled value function with a greedy policy, then it is necessary 
to consider value- gradients if we are to understand the overall convergence properties of any 
"whole system" weight-update. 

1.5 Efficiencies of learning Value-Gradients 

VGL introduces some significant efficiency improvements into value function learning. 



The removal of the need for exploration should be a major efficiency gain. As learning 
algorithms are already iterative, the need to explore neighbouring trajectories causes a 
nested layer of iteration. Also, the issue of exploration severely restricts standard algo- 
rithms, particu larly those that work by learning the Q(x, a, w) function (e.g. Sarsa(A), 
Q(A)-learning (j Watkinsl . Il98fll » . For example, whenever an action at in a trajectory 
is exploratory, i.e. non-greedy, the target values backed up to all previous time steps 
(i.e. the values V'k for all k < t) will be changed. This effectively sends the wrong 
learning targets back to the early time steps. This difficulty is dealt with in Sarsa(A) 
by making e slowly tend to zero in the e-greedy policy. This difficulty is dealt with 
in Q(A)-learning by forcing A = or truncating learning beyond the exploratory time 
step. Both of these solutions have performance implications (we show in Section 14.11 
that as e — > 0, learning grinds to a halt). 



• Learning the value-gradients along a trajectory is similar to learning the value func- 
tion's relative values along an entire group of immediately neighbouring trajectories 
(see Fig. |4j. Thus the value-gradients encapsulate more relevant information than 
the values, and therefore learning by value-gradients should be faster (provided the 
function approximator for V can be made to learn gradients efficiently, which is not 
a trivial problem). 

• As described in Section 12.21 some VGL algorithms are doing true gradient ascent on 
K* , so are viable to speed up through any of the fast neural-network optimisation 
algorithms available. 



These informal arguments are backed up by the experiments in Section 01 



2. Learning Algorithms for Value-Gradients 

The objective of any VGL algorithm is to ensure Gt = G't for all t > along a greedy 
trajectory. As proven in Appendix El this will be sufficient to ensure a locally extremal 
trajectory (and often a locally optimal trajectory). This section looks at some learning 
algorithms that try to achieve this objective. In Section 12.21 we derive a VGL algorithm 
that is guaranteed to converge to a locally optimal trajectory. 
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Define the sum-of-squares error function for value-gradients as 
E(x , w) = \ ~ G't) T n t (G t - G't) 



(12) 



i>l 



for a given gr eedy trajectory , is any arbitrarily chosen positive semi-definite matrix (as 
introduced by Werbos . 19981 ). included for generality, and often just taken to be the identity 
matrix for all t. A use for £lt could be to allow us to compensate explicitly for any rescaling 
of the dimensions of state-space. If bootstrapping is used then should be chosen to be 
strictly positive definite. 

One approach for VGL is to perform gradient descent on the above error function (giving 
the VGL counterpart to residual-gradients): 



Aw = a— — 

ow 



(13) 



This equation is analysed in Section 12.31 However a simpler weight update is to omit the 
"residual gradient" terms (giving the VGL counterpart to TD(A)): 



Aw 



a 



t> 



Gt) 



(14) 



In the next section we prove that this equation, with A = 1, is equivalent to PGL, and 
show it leads to a successful algorithm with convergence guarantees. 

Any VGL algorithm is going to involve using the matrices and/or ^ which, for 

neura l-networks, in volves second o rder back-propagation. This is described by ([White and Sofge , 
1991 ch. 10) and dOouloml . 120021 . Appendix A). In fact, these matrices are only required 
when multipl ied by a column vec tor, which can be implemented efficiently by extending the 
techniques of iPearlmutterl (119941 ) to this situation. 



2.1 Relationship to Policy- Gradient Learning 

We now prove that the VGL update of Eq. [T4l with A = 1 and a carefully chosen matrix, 
is equivalent to PGL on a greedy policy. It is this equivalence that provides convergence 
guarantees for Eq. [T41 To make this demonstration clearest, it is easiest to start by 
considering a PGL weight update; although it should be pointed out that the discovery of 
th is equiva l ence occurred the opposite way around, since forms of Eq. [H] date back prior 
to lWerbosj (jl998h . 

PGL, sometimes also known as "direct" reinforcement learning, is defined to be gradient 
ascent on R n (xq , w) with respect to the weight vector w of the policy: 



±w 



a 



OR 11 
dw 



Back-propagation through time (BPTT) is merely an efficient implementation of this 
form ula, des i gned for architectures where the policy tt(x, w) is provided by a neural-network 
(see lWerbosl)l990h . PGL methods will naturally find stationary points that are constrained 
locally optimal trajectories (see Appendix [A] for optimality definitions). 
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For PGL, we therefore have: 

dR 7T \ d(r(xt,ir(x t ,w)) + R w (f(x t ,n(x t ,w)),w)) 



dw ) t dw 



dwj t \\daj t \da J t \ dx J t+ J \ dw J t+1 

- ($.(£) J 

This equation is identical to the weight update performed by BPTT. It is defined for a 
general policy. We now switch to specifically consider PGL applied to a greedy policy. 
Initially we only consider the case where (^f) exists for a greedy trajectory, and hence 
(Jli)t ex i s ts f° r an t- Now in the summation of Eq. [T5l we only need to consider the time 
steps where a t is not saturated, since for a t saturated, (§^>) t = (by Lemma |2j). 

The summation involves terms (§^>) t and (gj) f which can be reinterpreted under the 
greedy policy: 

Lemma 3 The greedy policy implies, for an unsaturated action, 

daj t \daj t \daj t \dx /t+1 



dr\ ( df 



- 



oT G t+1 (16) 



da J t \da J t 



Lemma 4 When (§^) t exists for an unsaturated action at, the greedy policy implies (^fj = 
therefore, 

Q d f dQ(x t ,Tr{x t , w), w) \ _ ( d | ( dn\ d \ ( dQ(x t ,a t ,w) 



dw \ dat J \ dw \ dw J t dat J \ da t 

d ffdr\ (df\ _ \ {dir\ [d 2 Q 



<9tZJ \\da J . \da J , ' 7 \ 9w 7 . V 5a 2 



t 

T 



dw) t+1 \da) t \dw) t \da 2 ) 



dn 
dw 



f§L(IO, (S), '—*»(!£)/ ° (i7 » 



It now becomes possible to analyse the PGL weight update with a greedy policy. Sub- 
stituting the results of the above lemmas (Eq. [16] and Eq. [T7|) . and (^|jr) t = G't with 
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A = 1 (see Eq. [6]), into Eq. [T5l gives: 
Aw = a 



dw ,„ 



n 



£ (SI " (18) 



where 

and is positive semi-definite, by the greedy policy (Lemma[T|. 

Equation [18] is identical to a VGL weight update equation (Eq. [14"|) . with a carefully 

chosen matrix for fit, and A = 1, provided and ^^Jr^ exist for all i. If {§^) t 

does not exist, then ^15- is not defined either. Resolutions to these existence conditions are 
proposed at the end of this section. 

This completes the demonstration of the equivalence of a VGL algorithm (with the 
conditions stated above) to PGL (with greedy policy; when OS*- exists) . Unfortunately we 
couldn't find a similar analysis for A < 1, and divergence examples in this case are given in 
Section 14.31 

This result for A = 1 was quite a surprise. It justifies the omission of the "residual 
gradient terms" when forming the weight update equation (Eq. [T4"|) . Omitting these residual 
gradient terms is not, as it may have seemed, a puzzling modification to ||j; it is really 
^j-, (with A = 1, and the given fit). This means using the fit terms ensures an optimal 
value of R w is obtained, as shown in the experiment in Section 14.41 Also, it shows that 
VGL algorithms (and hence value function learning algorithms) are not that different from 
PGL algorithms after all. It was not known that a PGL weight update, when applied to 
a greedy policy on a value function, would be doing the same thing as a value function 
weight update; even if both had A = 1. Of course they are usually not the same, unless this 
particular choice of fit is chosen. 

This also provides a tentative justification for the TD(A) weight update equation (Eq. 
[TO]) . From the point of view of the author, this previously had no theoretical justification. It 
was seemingly chosen because it looks a bit like gradient descent on an error function, and 
the Bellman Equation happens to be a fixed point of it. This has been a hugely puzzling 
issue. There are no convergence guarantees for it and numerous divergence examples (in 
Section [4.31 we show it can even diverge with A = 1). Our explanation for it is that it is a 
stochastic approximation to Eq. [1"4"1 which itself is an approximation to PGL when A = 1. 

Also it is our understanding that this is a particularly good form of VGL weight update 
to make, since it has good convergence guarantees. If an alternative is chosen, e.g. by 
replacing fit by the identity matrix, then it might be possible to get much more aggressive 
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learning^ TD(A), being a stachostic approximation to Eq. [T4"l is fixed to implicitly use an 
identity matrix for f^. But this creates the unwanted problem of non-monotonic progress, 
in the same way that any aggressive modification to gradient ascent may do. It is also 
possible to get divergence in this case (see Section B~3j) . It is our opinion that it is better 
to use a more theoretically justifiable acceleration method such as conjugate-gradients or 
RPROP. 

This equivalence reduces the possible advantages of the value function architecture, in 
the case of A = 1, down to being solely a sophisticated implementation of a policy function. 
This sophisticated policy architecture may just be easier to train than other policies; just 
as some neural-network architectures are easier to train than others. It is not the actual 
learning algorithm that is delivering any benefits. 

We note that it also creates a difficulty in distinguishing between PGL and VGL. With 
A = 1 and as defined in Eq. [TUl the equation can no longer be claimed as a new learning 
algorithm, since it is the same as BPTT with a greedy policy. Therefore the experimental 
results will be exactly the same as for BPTT. However, we will describe the above weight 
update as a VGL upd ate; it is of the same form as Eq. [TH We also point out that forms of 
Eq. [Til came first (see Werbosl . 1998 ). before the connection to PGL was realised, and that 



it itself is an idealised form of the TD(A) weight update. 

This equivalence proof is almost a convergence proof for Eq. [18] with A = 1, since for the 
majority of learning iterations there is smooth gradient ascent on R n . The problem is that 
sometimes the terms (§^*) t are not defined and then learning progress jumps discontinuously. 
One solution to this problem could be to choose a function approximator for V such that 
the function Q(x,a,w) is everywhere strictly concave with respect to a, as is done in the 
Toy Problem experiments of Section [H A more general solution is given in the next section. 

Both of these solutions also satisfy the requirement that (^^r^J exists for all t. 
2.2 Continuous-Time Formulation 

In many proble ms ti me can be tr eated as a continuous variable, i.e. t £ [0,-F], as considered 
bv lDoval (|2000h and iBairdl (| 19941 ). With continuo us-tim e form ulations, some extra difficul- 



ties can arise for VL as described and solved by Bairdl ( 1994 ). but these do not apply to 



VGL, for reasons described further below. We describe a continuous-time formulation here, 
since, in some circumstances the greedy policy tt(x, w) becomes a smooth function of x and 
w. This removes the problem of undefined (§^) t terms that was described in the previous 
section, and leads to a VGL algorithm for control problems with a function approximator 
that is guaranteed to converge. 

We use bars over the previously defined functions to denote their continuous-time coun- 
terparts, so that f(x,a) and f(x,a) denote the continuous-time model functions. The 
trajectory is generated from a given start point xq by the differential equation (DE), 

^=f(x u n(x t ,w)) (20) 



3. In fact, in the continuous-time formulation of Eq. 1231 (f^?) = — d' ((l^r) + (^") ^ t )' an< ^ S ° 
setting Q t to the identity matrix is analogous to givin g the deri v ative of a sigmoid function an artificial 
boost (see Eq. 1 19p . This is like the trick proposed by iFahlmanl (|l988l ) that is sometimes used to speed 
up supervised learning in artificial neural networks, but at the expense of robustness. 
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The total reward for this trajectory is 



R n (x ,w) = / r(x t ,a t )dt 
lo 



The continuous-time Q function is 



Q(x, a, w) = r(x, a) + /"^(x, a)G(x, w) + V(x, w) 



(21) 



This is closely related to the "advantage function" ( Bairdl . 19941 ). that is A(x, a, w) = 
Q(x, a, w) — V(x, w). The major difference of the VGL approach over advantage functions is 
that "advantage learning" only learns the component of G that is parallel to the trajectory, 
and so it is similar to all VL algorithms in that const ant explora tion of the neighbouring 
trajectories must take place. Also, as pointed out by Doya ( 200d ). the problem of indis- 
tinguishable Q values that advantage-learning is designed to address is not relevant when 
using the following policy: 

We use the same policy as proposed by Dova ( 2000l ). The greedy policy does not need to 
look ahead in the continuous-time formulation, and instead relies only on the value-gradient 
at the current time. We assume the model functions are linear in a (which is common in 
Newtonian models, since acceleration is proportional to the force), and then we introduce 
an extra "action-cost" non-linear term, f (x, a), into the model's reward function, giving 



r(x, a) = r L (x, a) + r c '(x, a) 



(22) 



where f L (x,a) is the original linear reward function. The action-cost term has the effect 
of ensuring the action chosen by the greedy policy is bound to [—1,1], and also that the 
actions at are smooth functions of Gt- A suitable choice is f c (x,a) = — Jq g~ l (x)dx where 
g^ 1 is the inv erse o f q(x) = tanh(x/c), and c is a positive constant. This idea is explained 
more fully by Dova ( 2000l ). 

Using this choice of f c (x, a), and substituting Eq. [21] and Eq. [22] into the greedy policy 
gives: 



a t = n(x t ,w) 



au<l ( ^ 



dr L 
da 

df L 
da 



+ 



+ 



1 - a t 2 (dG_ 

dw 



df 
da 

df 
da 

df 
da 



(df\ fdG 



dx J ,\ dx 



Gi 



G t 



dG 
dw 



df 
da 



dfY 

da) t 



(23) 



For this policy, as c — > 0, the policy tends to "bang-bang" control. For c > 0, this 
policy function meets the objective of producing bound actions that are smooth functions 
of Gt, and therefore, since the function approximator is assumed smooth, are also smooth 
functions of w. This solves the problem of discontinuities described in the previous section. 



17 



Fairbank 



The solution works for any c > 0, so can get arbitrarily close to the ideal of bang-bang 
control. 

Using this policy, the trajectory can be calculated via Eq. [20] using a suitable DE solver. 
For small c, the action s can rapidly alte rnate and so the DE may be "stiff" and need an 
appropriate solver (see IPress et all llflfll ch.16.6). 



For the learning equations in continuous-time, we use A as the 'bootstrapping' param- 
eter. This is related to the discrete time A by e _AA * = A where At is the discrete-time 
time-step. This means A = gives no bootstrapping, and that bootstrapping increases as 
A — > oo, i.e. this is the opposite way around to the discrete-time parameter A. 

The equations in the rest of this section were derived in a similar manner to the discrete- 
time case, and by letting the discrete time-step At — > 0. 

The continuous-time target-value has several different equivalent formulations: 



V' t0 = jf e- x ^ (f t + ^jdt + V t0 = ^ e-~ x ^(r t + XV t )dt 
■ —ft + A (V't — Vt) with boundary condition V'f = 



dV't 



dt 

The target value-gradient is given by: 

dG' t ( Dr\ (Df 



dt \ Dx I , V Dx 



) (G' t ) + A (G't - G t ) (24) 



with boundary condition G'f = 0, and where = ^ + f§^- This is a DE that needs 
solving with equivalent care to which the trajectory was, and it may also be stiff. Note 
that in this equation, f is the full f as defined in Eq. [22j Also, in the case of an episodic 
problem where a final impulse of reward is given, an alternative boundary condition to this 
one may be required — see Appendix [Ej] for a discussion and example. 
For this policy, model and A = 0, the PGL weight update is: 

A *-°(w) - a jf(S), fl '< G ''- Gi >* (25) 



where Qt = — f 4- V^V an< ^ ^ S P os ^^ ve serm -definite. 

This integral is the exact equation for gradient ascent on R w . Therefore, if implemented 
precisely, termination will occur at a constrained (with respect to w) locally optimal trajec- 
tory (see Appendix [A] for optimality definitions). However, numerical methods are required 
to evaluate this integral and the other DEs in this section. For example, the above integral 
is most simply approximated as: 

F 



A ^ = «E(iD V t (G' t -G t )At 



which is very similar to the discrete-time case (Eq. [T£j) . 

The fact that this algorithm is gradient ascent means it can be spee ded up with any 
of the fast optimisers available, e.g. RPROP (jRiedmiller and Braunl . which becomes 



very useful when c is small and therefore Aw becomes very small. 
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Eq. [251 was derived for A = although it can, in theory, be applied to other A. In this 
case, it is thought to be necessary to choose a full-rank version of f2. However our results 
with bootstrapping with a function approximator are not good (see Section f4.5p . 

2.3 Residual Gradients 

In this section we derive the formulae for full gradient descent on the value-gradient error 
function, according to Eq. [12] and Eq. [131 The particularly promising thing about this 
approach is that it has good convergence guaran tees fo r any A. This kind of full gradient 



descent met hod is known as residual gradients by IBairdl (|1995l ) or as using the Galerkinized 
equations by Werbos ( 19981 ). 



To calculate the total derivative of E, it is easier to first write E in recursive form: 
E(x t ,w) = i(G, - G' t ) T n t (G t - G' t ) + E(x t+ i,w) 
with E(xf,w) = at the terminal time step. This gives a total derivative: 

(II), = ((§§), - °^ - G ' t] + (fl) t (l) t (i) m + (i) m 

with = and where (§§) i+1 is found recursively by 

l (8E\ ,< \ ' . . I . . , • r . I . , l • • , , ,• . r 



t+1 

with (||) F = 0. Note that this goes further than doing residual-gradients for VL in that 
there is a consideration for This is necessary for true gradient-descent on E with respect 
to the weights, since in this paper we say the value function and greedy policy are tightly 
coupled, and therefore updating w will immediately change the trajectory. This can be 
verified by evaluating ^ numerically. 

Although this weight update performs relatively well in the experiments of Section [4j 
our general experience of this algorithm is that it often gets stuck in far-from-optimal local 
minima. This was quite pu zzling and w as not explained by previous criticisms of residual 



gradients (for example, see [Bairdj, Il995l ). since these only applied to stochastic scenarios. 
It seems that many of the local minima of E are not necessarily valid local maxima of 
R T . We speculate that choosing to include the residual gradient terms is analogous to 
choosing to maximise a function f(x) by gradient descent on (f'(x)) 2 . This makes it difficult 
to distinguish the maxima from the unwanted minima, inflections and saddle points, and 
although the situations are not identical, an effect like this may be reducing the effectiveness 
of the residual gradients weight update. This contrasts with the non-residual gradients 
approach (Eq. [T4"j) where Aw = u^jj which is analogous to maximising f(x) directly (if 
A = 1; see Section [2. ip . By the arguments of Section \1A\ we would expect this explanation 
to apply to VL with residual gradients too. 

To illustrate this problem by a specific example, consider a variant of the 1-step Toy 
Problem with k = 0, modified so that the final reward is R(x\) = —x\ 2 + 4cos(xi) instead 
of the usual R(x%) = —x\ 2 . Then the optimal policy is vt*(xq) = — xq. Let the function 
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Figure 5: Graphs showing the spurious minima traps that can exist for Residual Gradient 
methods compared to direct methods. 



approximator be V(x\,w) = —x\ 2 + wx\, so that the greedy policy with this model gives 
ir(xo,w) = w/2 — xq. This gives R w (xo,w) = —w 2 /4 + 4cos(w;/2) which has just one 
maximum, at w = 0, corresponding to the optimal policy. The residual value-gradient 
error is E(xo,w) = \{G\ — G'\) 2 = ^(w — 4sin(u;/2)) 2 which has many local minima (see 
Figured]), only one of which corresponds to the optimal policy. The spurious minima in 
E correspond to points of inflection in Rq, of which there are infinitely many. Therefore 
gradient descent on E is more likely than not to converge to an incorrect solution, whereas 
gradient ascent on R w will converge to the correct solution. 



3. Actor-Critic Architectures 



This section discusses the use of actor-critic architectures with VGL. It shows that in some 
circumstances the actor-critic architecture can be shown to be equivalent to a simpler ar- 
chitecture. While this can be used to provide convergence guarantees for the actor-critic 
architecture, it also makes the actor-critic architecture redundant in these circumstances. 



An actor-critic architecture (B arto et all Il983l : IWerbosl . Il998l : iKonda and Tsitsiklisl . 



2003) uses two neural- networks, or more generally two function approximators, in a control 
problem. The first neural-network, parametrised by weight vector z, is the "actor" which 
provides the policy, tt(x, z). In an actor-critic architecture the greedy policy is not used, 
since the actor neural-network is the policy. The second neural-network, parametrised by a 
weight vector w, is the "critic" and provides the value function V(x,w). 

For this section we extend the definition of V' and G' to apply to trajectories found by 
policies other than the greedy policy. To define V for an arbitrary policy 7r(x, z) and value 
function V(x, w) we use: 

V'(x, w, z) = r[x, tt(x, z)) + \V'{f{x, n(x, z)),w, z) + (1 - X)V(f{x, vr(x, z)),w) (26) 

with V' (xf, w, z) = 0. This gives G'(x, w, z) = 9V g^"''^ and for a given trajectory we can 
use the shorthand V't = V'(xt, w, z) and G't = G'(xt, w, z). 

The value-gradient version of the actor training equation is as follows ( Werbosl . 19981 . 
Eq. 11): 

(£),((£).♦ (£)««) 
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where a is a small positive learning-rate and Gt+i is calculated by the critic. This equation is 
almost identical to the PGL equation (BPTT, Eq. fl~5j) except that Gt+i has been substituted 
for (^-) t M. Eq. [27] is a non-standard actor training equation, however in Appendix [F] we 
prove it is equivalent to at least one other actor training equation and demonstrate how it 
automatically incorporates exploration. 

The critic training equation would attempt to attain Gt = G't for all t by some appro- 
priate method, for example by Eq. [bfl which is the only update we consider here. 

In this section, it is useful to define the notion of a hypothetical ideal function approx- 
imator. This is a function approximator that can be assumed to have enough degrees of 
freedom, and a strong enough learning algorithm, to attain its desired objectives exactly, 
e.g. Gt = G't for all t, exactly. We also refer to an ideal critic and an ideal actor, which are 
based on ideal function approximators. 

Training an actor and a critic together gives several possibilities of implementation; 
either one could be fixed while training the other, or they could both be trained at the 
same time. We only analyse the situations of keeping one fixed while training the other. 
Doing a long iterative process (training one) within another long iterative process (training 
the other) is very bad for efficiency, which may make the cases analysed seem infeasible 
and therefore of little relevance. However, the analyses below show both of these situations 
have equivalent architectures that are efficient and feasible to implement, and therefore are 
relevant. 

It is noted that the results in this section should also apply to VL actor-critic system, 
since as discussed in Section [L4l VL is a stochastic approximation to VGL. 

3.1 Keeping the Actor Fixed while training the Critic 

In this scenario we keep the actor fixed while training an ideal critic fully to convergence, 
and then apply one iteration of actor training, and then repeat until both have converged. 
It is shown that, if the critic is ideal, then this scenario is equivalent to back-propagation 
through time (BPTT) for any A. 

Since the actor is fixed, the trajectory is fixed; therefore an ideal critic will be able to 
attain the objective of Gt = G't for all t along the trajectory. Then since Gt = G't for all 
t, we have Gt = (^§r) t fo r ai l t (proof is in Appendix^] lemma[5]). Therefore the actor's 
weight update equation (Eq. [2Tj) becomes identical to Eq. [TSJ Therefore we could omit 
the critic and replace G with in the actor training equation (i.e. we remove the inner 
iterative process, and remove the actor-critic architecture), and we are left with BPTT. 
This shows that this actor-critic is guaranteed to converge, since it is the same as BPTT. 

The above argument assumed the critic was ideal. This may be an unrealistic assumption 
since a real function approximator can have only finite flexibility. However, the objective 
of the function approximator is to learn Gt = G't for all t, and this goal can be achieved 
exactly; simply by removing the critic. In effect, by removing the critic, a virtual ideal 
function approximator is obtained. It is assumed that there would be no advantage in using 
a non-ideal critic. 

Conclusion: BPTT is equivalent to the idealised version of this architecture, and there- 
fore the idealised version of this architecture is guaranteed to converge to a constrained 
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locally optimal trajectory (see Appendix lAl for optimality definitions). The idealised ver- 
sion of this architecture is attainable by removing the critic. 

3.2 Keeping the Critic Fixed while training the Actor 

In this scenario we consider keeping the critic fixed while training an ideal actor fully to 
convergence, and then apply one iteration of training to the critic and repeat. 

The actor weight update equation (Eq. |27|) tries to maximise Q(x t ,a t ,w) with respect 
to at at each time step t. If this is fully achieved, then the greedy policy will be satisfied. 
Therefore the ideal actor can be removed and replaced by the greedy policy, to get the same 
algorithm. This again removes the innermost iterative step, and removes the actor-critic 
architecture. Again, it is assumed that there would be no advantage in using a non-ideal 
actor. 

Now when it comes to the critic-training step (Eq. HI]) , do we allow for the fact that 
changing the critic weights is going to change the actor in a predictable way? If so, then 
we treat Qt as defined in Eq. [TUl Otherwise we are working as if the actor and critic are 
fully separated, and we are free to choose Oj. Having made this choice and substitution, 
the actor is redundant. 

Conclusion: Keeping the critic fixed while training an idealised actor is equivalent to 
using just the critic with a greedy policy. This idealised architecture can be efficiently 
attained by removing the actor. 

4. Experiments 

In this section a comparison is made between the performance of several weight update 
strategies on various tasks. The weight update formulae considered are summarised in 
Tabled) 

The first few experiments are all based on the n-step Toy Problem defined in Section 
II. 1.11 The choice of this problem domain was made because it is smooth, deterministic, 
concave and possible to make the experiments easily describable and reproducible. Within 
this choice, it makes a level playing field for comparison between VL and VGL algorithms. 

The final experiment is a neural-network based experiment on a Lunar-Lander problem 
specified in Appendix [El This is a new benchmark problem defined for this paper; the 
problem with most current existing benchmark problems was that they are only defined for 
continuing tasks, discrete state-space tasks or tasks not well suited to local exploration. 

In the Toy Problem experiments, all weight components were initialised with a uniform 
random distribution over a range from —10 to +10. The experiments were based on 1000 
trials. The stopping criteria used for each trial were as follows: A trial was considered a 
success when \wi — w*\ < 10~ 7 for all components i. A trial was considered a failure if 
any component of w became too large for the computer's accuracy, or when the number of 
iterations in any trial exceeded 10 7 . A function RND(e) is defined to return a normally 
distributed random variable with mean and standard deviation e. The variables c\, C2, 
C3 are real constants specific to each experiment designed to allow further variation in the 
problem specifications. 
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Weight Update Formula 


Abbreviation 


value- .Learning upuate, l\j[a), (^iijCj. hum . 


V Li{A ) 


Value-Gradient Learning Update (Eq. [18]), using full Qt matrix from 

Eq. na 


VGLft(A) 


Value-Gradient Learning Update (Eq. [14]), but using Qt as the identity 
matrix for all t. 


VGL(A) 


Value-Gradient Learning Update, residual gradients (Eq. [13]), but using 
Qt a s the identity matrix for all t. 


VGLRG(A) 



Table 1: Weight update formulae considered. 



4.1 Experiment 1: One-step Toy Problem 

In this experiment the one-step Toy Problem with k = was considered from a fixed start 
point of xq = 0. A function approximator for V with just two parameters (u>i,u>2) was 
used, and defined separatelj@ for the two time steps: 

— (X% — Cl) 2 + WlX\ + U>2 if t = 1 

if t = 2 

where c\ is a real constant. 

Using this model and function approximator definition, it is possible to calculate the 
functions Q(x, a, w), G(x, w) and the e-greedy policy tt(x, w) (which again must be defined 
differently for each time step). These functions are listed in the left-hand column of Table 
EJ Using these formulae, and the model functions again, the full trajectory is calculated in 
the right-hand column of Table [21 Also in this right-hand column, V', G' and f2 have been 
calculated for each time step using Eq. HI Eq. [6] and Eq. [19] respectively. These formulae 
would have to be evaluated sequentially from top to bottom. 

The e-greedy policy was necessary for the VL experiments. When applying the weight 
update formulae, expressions for and were calculated analytically from the functions 
given in the left column of Table [2] For example, for this function approximator we find 

Note that as W2 does not affect the trajectory, this component was not used as part of 
the stopping criteria. 

Results for these experiments using the VL(A) and VGL(A) algorithms are shown in 
Table [3] This set of experiments verifies the failure of VL when exploration is removed; 
the slowing down of VL when it is too low; and the blowing-up of VL when it is too high 
(in this case failure tended to occur because the size of weights exceeded the computer's 
range). The efficiency and success rate of the VGL experiments is much better than for the 
VL experiments, and this is true for both values of c\ tested. In fact, the problem is trivially 
easy for the VGL(A) algorithm, but causes the VL(A) algorithm considerable problems. 

To gain some further insight into the different behaviour of these two algorithms we can 
look at the differential equations that the weights obey. For the VGL(A) system of this 

4. See Section [LLT] for an explanation on this abuse of notation. 



V(x t ,Wi,W 2 ) 
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Function approximator and 
e- Greedy Policy 


Sequential trajectory 
equations 


Time step 1: 

V(xi,Wl,W 2 ) = — (X\ - Cl) 2 + WlX\ + w 2 

=> G(xi,wi,w 2 ) = 2(ci — xi) +wi 
Q(x , ao,w) = -(x + a - ci) 2 

+ w\(xq + ao) + W2 
vr(x , w) = (2ci - 2x + wi)/2 + RND(e) 

Time step 2: 

V(X2,Wi,W 2 ) = 

=4- G(X 2 ,W 1 ,W2) = 


x <- 

a (2ci - 2x + u>i)/2 + RND(t) 
x\ *— x + a 
K'l - -xi 2 
G'i -2xi 

Vi < (Xl - Cl) 2 + X\W\ + W2 

Gi <— 2(ci - xi) + u>i 
$7o \ 


Optimal Policy: 7r*(xo) = — xo 


Optimal Weights: wi* = —2c\ 



Table 2: Functions and Trajectory Variables for Experiment 1. 



Cl 


e 


a = 0.01 


a = 0.1 


a = 1.0 


Success 
rate 


Iterations 


Success 
rate 


Iterations 


Success 
rate 


Iterations 


(Mean) (s.d.) 


(Mean) (s.d.) 


(Mean) (s.d.) 


Results for algorithm VL(A) 





10 


66.4% 


1075.1 


293.35 


0.0% 






0.0% 









1 


100.0% 


1715.8 


343.31 


87.6% 


163.52 


31.948 


3.8% 


134.86 


59.643 





0.1 


100.0% 


172445 


31007 


89.5% 


17160 


3033.6 


16.5% 


1527.6 


118.39 








0.0% 






0.0% 






0.0% 






10 


1 


99.4% 


6048.5 


270.86 


0.0% 






0.0% 






Results for algorithm VGL(A) 








100.0% 


1728.2 


112.05 


100.0% 


166.15 


11.481 


100.0% 


1 





10 





100.0% 


1898.5 


51.986 


100.0% 


181.59 


14.861 


100.0% 


1 






Table 3: Results for Experiment 1. Note because this is a 1-step problem, A is irrelevant. 
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experiment, by going through the equations of the right-hand column of Table [2] and the 
VGL(A) weight update equations, we can eliminate all variables except for the weights and 
constants to obtain a self-contained pair of weight update equations: 

J Aw\ = —a (2ci + u>i) 
\ Aw 2 = 

Taking a to be sufficiently small, these become a pair of coupled differential equations. The 
solution is a straight line across the w plane directly to the optimal solution wl = —2c\. 

Doing the same for the VL(A) system, and integrating over the random variable RND(e) 
to average out the effects of exploration, gives a similar pair of coupled weight update 
equations: 

(Aiui) = -a (ci + 3|L) (2e 2 + ci 2 + 2c lWl + + w 2 ) 

(Aw 2 ) = -a (ci 2 + 2 Cl wi + + vi2j 

There is no known full analytical solution to this pair of equations. However it is clear 
that the second equation is continually aiming to achieve w 2 = — (ci 2 + 2c\W\ + u>i 2 /2). 
In the case that this is achieved, both equations would then simplify to the VGL coupled 
equations, but with a magnitude proportional to e 2 . This shows that if e = 0, the value- 
gradient part of these equations vanishes. It is also noted that in this case experiments show 
learning fails. Hence it is speculated that none of the other terms in the VL(A) coupled 
equations are doing anything beneficial, and that it is unlikely they will ever do so even in 
more complicated systems. Very informally, this example illustrates how VGL applies just 
the "important bits" of a VL weight update (in this example at least). 

4.2 Experiment 2: Two-step Toy Problem, with Sufficiently Flexible Function 
Approximator 

In this experiment the two-step Toy Problem with k = 1 is considered from a fixed start 
point of xq = 0. A function approximator is defined differently at each time step, by four 
weights in total: 

{— c\X\ 2 + w\x\ + w 2 if t = 1 
—c 2 x 2 2 + w 3 x 2 + Wi lit = 2 
if t = 3 

where c\ and c 2 are real positive constants. The consequential functions and variables for 
this experiment are found and presented in a similar manner as for Experiment 1, in Table 

m 

For ease of implementation of the residual-gradients algorithm, the expressions in the 
right-hand column of Table [4] for Gt and G\ were used to implement a sum-of-squares error 
function E(w) (Eq. [T2]) . with Clt = 1. Numerical differentiation on this function was then 
used to implement the gradient descent. For a larger scale system, it would be more efficient 
and accurate to use the recursive equations given in Section 12.31 

Results for the experiments are given in Table[5j These results show all VGL experiments 
performing significantly better than the corresponding VL experiments; in most cases by 
around two orders of magnitude. The results also show that for all of the VGL (A) algorithm 
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Function approximator and 
e- Greedy Policy 


Sequential trajectory 
equations 


Time step 1: 

V(xi, Wl, W2) = —C\X\ 2 + W\X\ + W2 

G(xi,wi, W2) = — 2c\xi +wi 
Q(x , a , w) = -ka 2 - c\{xq + a ) 2 

+ wi(x + a ) + W2 
^w) = ^f^+RND{e) 
Time step 2: 

V(X2,W3, W4) = -C2X2 2 + W3X2 + W4 
=> G(X2,W 3 ,W 4: ) = -2c 2 X2 + w 3 

Q(xi,ai,w) = -ka\ 2 - c 2 (xi + ai) 2 
+ Wz(x\ + 01) + W4 

*{xuw) = %^+RND{e) 
Time step 3: 
V(x 3 ) = 
=> G(x 3 ) = 


x Q <- 

«o - + RND ^) 

x\ <— xo + ao 

x 2 <— Xi + a x 
y' 2 - -x 2 2 
G' 2 -2x 2 

V 2 < C 2 X 2 2 + ^3^2 + W4 

G 2 < 2c 2 x 2 + w 3 

fi l <- 2(c 2 +fc) 

V'l «- -fca! 2 + XV '2 + (1 - A)F 2 

q, , 2c2feai+fc(AG'2+(l-A)G2) 
V\ < ClXl 2 + WlXl + W2 

G\ < 2c\x\ + w\ 

^0 *~ 2( C1 +k) 


Optimal Weights: w\* = w 3 * = 



Table 4: Functions and Trajectory Variables for Experiment 2. 



results, increasing a from 0.01 to 0.1 brings the number of iterations down by a factor 
of approximately 10, which hints that further efficiency of the VGL algorithms could be 
attained. 

The optimal value function, denoted by V* , for this experiment is 



V*(x t ) = 




if t = 1 
if t = 2 



For this reason most experiments were done with c\ = ^ and c 2 = 1. However the only 
necessity is to have c\ > and c 2 > 0, since these are required to make the greedy policy 
produce continuous actions; a problematic issue for all value function architectures. 

4.3 Experiment 3: Divergence of Algorithms With Two-step Toy Problem 

We now study the Toy Problem to try to find a set of parameters that cause learning to 
become unstable. Surprisingly the two-step Toy Problem is sufficiently complex to provide 
examples of divergence, both with and without bootstrapping. By the principles argued 
in Section 11.41 we would expect these examples that were found for VGL to also cause 
divergence with the corresponding VL methods. This is confirmed empirically. 

If we take the previous experiment and consider the VGLf2(A) weight update then the 
only two weights that change are w = (wi,w 3 ) T . The weight update equation for these two 
weights can be found analytically by substituting all the equations of the right hand side of 
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Weight Update 
Algorithm (A) 


e 




C2 


a = 0.01 


a = 0.1 


Success 
rate 


Iterations 


Success 
rate 


Iterations 


(Mean) 


(s.d.) 


(Mean) 


(s.d.) 


VL(1) 


1 


0.5 


1 


100 0% 


244122 


252234 


91.3% 


736030 


743920 


VL(1) 


0.1 


0.5 


1 


100 0% 

-L \J\J . \J / U 


135588 


17360 


100 0% 


21406.6 


8641 


VGL(l) 


o 


0.5 


1 


100.0% 


1596.8 


72.58 


100.0% 


152.5 


13.79 


VGL0(1) 





0.5 


1 


100.0% 


6089.1 


340.09 


100.0% 


600.7 


47.06 


VGLRG(l) 





0.5 


1 


100.0% 


794.6 


40.30 


100.0% 


72.2 


5.36 


VL(0) 


1 


0.5 


1 


100.0% 


244368 


252114 


91.6% 


734029 


742977 


VL(0) 


0.1 


0.5 


1 


100.0% 


138073 


17630 


99.9% 


21918 


8664 


VGL(O) 





0.5 


1 


100.0% 


1743.7 


103.41 


100.0% 


166.2 


12.81 


VGLft(O) 





0.5 


1 


100.0% 


6516.5 


375.99 


100.0% 


643.0 


39.12 


VGLRG(O) 





0.5 


1 


100.0% 


1252.4 


92.81 


100.0% 


118.2 


11.63 


VL(1) 


0.1 


4 


1 


100.0% 


228336 


60829 


100.0% 


78364 


62085 


VGL(l) 





4 


1 


100.0% 


5034.7 


340.6 


100.0% 


495.1 


35.7 


VL(1) 


0.1 


0.1 


1 


100.0% 


134443 


16614 


100.0% 


20974 


8569 


VGL(l) 





0.1 


1 


100.0% 


1516.2 


89.5 


100.0% 


144.4 


13.8 



Table 5: Results for various algorithms on Experiment 2. 

Table H] into the VGLr2(A) weight update equation, and using e = 0, giving: 

Aw = aDEDw 
(k + A(l + b)(b(k + 1) + 1) - bk) (X(k + l)(b + 1) - k) 



Wiih E ' ' 1 + &(*; + !) (k + 1) I' " x ~>\ 



' dir s 



C2+k and£>- ^ Q l/2(k + c 2 ) J 



We can consider more types of function approximator by defining the weight vector w 
to be linear system of two new weights p = (p\,P2) T such that w = Fp and where F is a 
2x2 constant real matrix. If the VGLO(A) weight update equation is now recalculated for 
these new weights then the dynamic system for p is: 

Ap = a{F T DEDF)p (28) 

Taking a to be sufficiently small, then the weight vector p evolves according to a 
continuous-time linear dynamic system, and this system is stable if and only if the ma- 
trix product F T DEDF is stable (i.e. if the real part of every eigenvalue of this matrix 
product is negative). 

The VGL(A) system weight update can also be derived and that system is identical to 
Eq. [28] but with the leftmost D matrix omitted. 

Choosing A = 0, with c\ = c 2 = k = 0.01 and F = D^ 1 f ^ ^ ^ leads to divergence 

for both the VGL(A) and VGLf2(A) systems. Empirically, we found that these parameters 
cause the VL(0) algorithm to diverge too. This is a specific counterexample for the VL 
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Divergence 


Proven 


Algorithm (A) 


example found 


to converge 


VL(1) 


Yes 


No 


VL(0) 


Yes 


No 


VGL(l) 


Yes 


No 


VGL(O) 


Yes 


No 


VGL0(1) 


No 


Yes 


VGLft(O) 


Yes 


No 



Table 6: Results for Experiment 3: Which algorithms can be made to diverge? 



system which is "on-policy" and equivalent to Sarsa. Previous examples of divergence for a 
function app roximator with bootstrapping have usu all y been for "off-poli c y" lea rning (see 
for example, Band, 19951 ; Tsitsiklis and Rov . 1996bl ). Tsitsiklis and Rov (jl996ah describe 
an "on-policy" counterexample for a non-linear function approximator, but this is not for 
the greedy policy. 

Also, perhaps surprisingly, it is possible to get instability with A = 1 with the VGL(A) 
system. Substituting c 2 = k = 0.01, c\ — n °° 1 L ' n_J 



0.99 and F = D 



makes the 



10 1 

VGL(l) system diverge. This result has been empirically verified to carry over to the VL(1) 
system too, i.e. this is a result where Sarsa(l) and TD(1) diverge. This highlights the 
difficult y of control-pr oblems in comparison to prediction tasks. A prediction task is easier, 
since as Sutton showed, the A = 1 system is equivalent to gradient descent on the 

sum-of-squares error E = T,t(V't — Vt) 2 , and so convergence is guaranteed for a prediction 
task. However in a control problem, even when there is no bootstrapping, changing one 
value of V't affects the others by altering the greedy actions. This problem is resolved by 
using the VGLf2(l) weight update. 

The results of this section are summarised in Table El 



4.4 Experiment 4: Two-step Toy Problem, with Insufficiently Flexible 
Function Approximator 

In this experiment the two-step Toy Problem with k = 2 was considered from a fixed start 
point of xq = 0. A function approximator with just one weight component, (w±), was 
defined differently at each time step: 

{— c\X\ 2 + w\x\ if t = 1 

-c 2 x 2 2 + (wi - c 3 )x 2 if t = 2 
if t = 3 

Here c\ = 2, c 2 = 0.1 and C3 = 10 are constants. These were designed to create some conflict 
for the function approximator's requirements at each time step. The optimal actions are 
ao = a\ = 0, and therefore the function approximator would only be optimal if it could 
achieve 9V g^ w ^ = at x = for both time steps. The presence of the C3 term makes this 
impossible, so a compromise must be made. 
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Sequential equations 


x <- 

, wi-2cia!o 
fl «~ 2( Cl +fc) 

xi <— xo + ao 

„. , wi-C3-2c2a;i 
ttl ^~ 2(c 2 +fc) 

X 2 ^ Xi+ ai 
^0 <~ 2 ( Cl +fc) 


G" 2 «- -2x 2 

G 2 < 2c 2 x 2 + wi 

r , 2c 2 fca 1 +fc(AG'2+(l-A)G 2 ) 

Gi < 2ciXi + w\ 

n i *~ 2(c 2 +k) 



Table 7: Trajectory Variables for Experiment 4. 



Only the value-gradient algorithms were considered in this experiment; hence there was 
no need for exploration or use of the e-greedy policy. The equations for the trajectory are 
very similar as to Experiment 2, and so only the key results are listed in Tabled A different 
stopping criterion was used in this experiment, since each algorithm converges to a different 
fixed point. The stopping condition used was |Au>ij < (10~ 7 a). Once the fixed point had 
been reached we noted the value of R, the total reward for that trajectory, in Table Each 
algorithm experiment used a = 0.01, attained 100% convergence, and produced the same 
value of R each run. In these trials, a was not varied, and the iteration result columns are 
not very meaningful, since it can be shown for each algorithm that Awi is a linear function 
of w\. This indicates that any of the algorithms could be made arbitrarily fast by fine 
tuning a, and also confirms that there is only one fixed point for each algorithm. 

The different values of R in the results show that the different algorithms have varying 
degrees of optimality with respect to R. The first algorithm (VGLf2(l)) is the only one 
that is really optimal with respect to R (subject to the constraints imposed by the awkward 
choice of function approximator), since it is equivalent to gradient ascent on K* , as shown 
in section 12. 1L 

It is interesting that the other algorithms converge to different suboptimal points, com- 
pared to VGLf2(l). This shows that introducing the terms and using A = 1 balances 
the priorities between minimising {G'\ — G\) 2 and minimising (G'2 — G2) 2 appropriately so 
as to finish with an optimal value for R. Different values for the constants c\ and C2 were 
chosen to emphasise this point. The relative rankings of the other algorithms may change 
in other problem-domains, but it is expected that VGLr2(l) would always be at the top. 
Hence we use this algorithm as our standard choice of algorithm out of those listed in this 
paper; to be used in conjunction with a robust optimiser, e.g. RPROP. 



4.5 Experiment 5: One-Dimensional Lunar-Lander Problem 

The objective of this experiment was to learn the "Lunar-Lander" problem described in 
Appendix [El The value-fu nction was provided by a fully connected multi-layer perceptron 
(MLP) fsee TBishopi (jl995l ) for details). The MLP had 3 inputs, one hidden layer of 6 units, 
and one output in the final layer. Additional shortcut connections connected all input units 
directly to the output layer. The activation functions were standard sigmoid functions in the 
input and hidden layers, and an identity function (i.e. linear activation) in the output unit. 
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Weight Update 
Algorithm (A) 


R 


Iterations 


(Mean) 


(s.d.) 


VGL0(1) 


-2.65816 


2327 


247 


VGLRG(l) 


-2.68083 


183 


19 


VGL(l) 


-2.79905 


500 


51 


VGL^(O) 


-2.82344 


3532 


352 


VGLRG(O) 


-3.97316 


256 


21 


VGL(O) 


-5.76701 


1077 


87 



Table 8: Results for Experiment 4, ranked by R. 



The input to the neural-network was (/i/100,v/10,u/50) J , and the output was multiplied by 
100 to give the value function. Each weight in the neural-network was initially randomised 
to lie in [—1,1], with uniform probability distribution. 

In this section results are presented as pairs as diagrams. The left-hand diagrams show 
a graph of the total reward for all trajectories versus the number of training iterations, and 
compare performance to that of an optimal policy. The optimal policy's performance was 
calculated by the theory described in Appendix IE. 21 That appendix also shows example 
optimal trajectories. The right-hand diagrams show a cross-section through state space, 
with the y-axis showing height and the x-axis showing velocity. The final trajectories 
obtained are shown as curves starting at a diamond symbol and finishing at h = 0. 

All algorithms used in this section were the continuous-time counterparts to those stated, 
and are described in Section 12.21 Also, all weight updates were combined with RPROP for 
acceleration. For implementing RPROP, the weight update for all trajectories was first 
accumulated, and then the resulting weight update was fed into RPR OP at the end of each 
i terati on. RPROP was used with the default parameters defined by iRiedmiller and Braun 
(ll993h . 

Results for the task of learning one trajectory from a fixed start point are shown in Figure 
[6]for the VGLS7 algorithm. The results were averaged over 10 runs. VGLfi worked better on 
this task than VGL. VGL sometimes produced unstable or suboptimal solutions. However, 
both could manage the task well with c = 1. It was not possible to get VL to work on this 
task at all. VL failed with a greedy policy, as expected, as there is no exploration. However 
VL also failed on this task when using the e-greedy policy. We suspect the reason for this 
was that random exploration was producing more unwanted noise than useful information; 
random exploration makes the spacecraft fly the wrong way and learn the wrong values. 
We believe this makes a strong case for using value-gradients and a deterministic system 
instead of VL with random exploration. 

The kink in the left diagram of Figure [6] was caused because c was low, so the gradient 
QJ^S- was tiny for the first few iterations. It seemed that RPROP would cause the weight 
change to build up momentum quickly and then temporarily overshoot its target before 
bringing it back under control. It was very difficult to learn this task with such a small c 
value without RPROP. 

Figure \7\ shows the performance of VGL and VL in a task of learning 50 trajectories from 
fixed start points. The VGL learning graph is clearly more smooth, more efficient and more 
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Figure 6: Learning performance for a single trajectory on the Lunar-Lander problem. Pa- 
rameters: c = 0.01, At = 0.1, A = 0. The algorithm used was VGLS7. The 
trajectory produced is very close to optimal (c.f. Fig. [9]). 



optimal than the VL graph. VGLS7 and VGL could both manage this task, achieving close 
to optimal performance each time, but only VGLQ could cope with the smaller c values in 
the range [0.01, 1). 




Figure 7: Learning performance for learning 50 trajectories simultaneously on Lunar- 
Lander problem. Parameters: c = 1, At = 1, A = 0. The algorithms used 
were VL and VGL. Graphs were averaged over 20 runs. The averaging will have 
had a smoothing effect on both curves. The right graph shows a set of final 
trajectories obtained with one of the VGL trials, which are close to optimal. 



It was difficult to get VL working well on this problem at all, and the parameters chosen 
were those that appeared to be most favourable to VL in preliminary experiments. Having 
such a high c value makes the task much easier, but VL still could not get close to optimal, 
or stable, trajectories. No stochastic elements were used in either the policy or model. 
The large number of fixed trajectory start points provided the exploration required by VL. 
These gave a reasonable sampling of the whole of state space. Preliminary tests with the 
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e-greedy policy did not produce any successful learning. The policy used was the greedy 
policy described in Appendix [El 

We could not get bootstrapping to get to work on this task for either the VL or VGL 
algorithms. With bootstrapping, the trajectories tended to continually oscillate. It was not 
clear why this was, but it is consistent with the lack of convergence results for bootstrapping. 

It is desirable to have c as small as possible to get fuel-efficient trajectories (see for 
example Figure [9|) , but a small c makes the continuous-time DE more stiff. However our 
experience was that the limiting factor in choosing a small c was not the stiffness of the 
DE, but the fact that as c — > 0, — > which made learning with VGLO difficult. Using 
RPROP largely remedied this since it copes well with small gradients and can respond 
quickly when the gradient suddenly changes. We did not need to use a stiff DE solver, and 
found the Euler method adequate to integrate the equations of Section 12.21 

5. Conclusions 

This section summarises some of the issues and benefits raised by the VGL approach. Also, 
the contributions of this paper are highlighted. 

• Several VGL algorithms have been stated. These are algorithms VGL(A), VGLf2(A) 
and VGLRG(A) (see Table [1]), with their continuous time counterparts, and actor- 
critic learning; all defined for any < A < 1. Results on the Toy Problem and the 
Lunar-Lander are better than the VL results by several orders of magnitude, in all 
cases. 

• The value-gradient analysis goes a large way to resolving the issue of exploration in 
value function learning. Local exploration comes for free with VGL. Other than the 
problem of local versus global optimality, the problems of exploration are resolved, 
and value function learning is put onto an equal footing with PGL. For example, as 
discussed in Section [L5] exploration had previously caused difficulties in Q(A)-learning 
and Sarsa(A). 

• In Appendix A, definitions of extremal and optimal trajectories and an optimality 
proof are given for learning by value-gradients. The proof refers to Pontryagin's 
Maximal Principle (PMP), but in the case of "bang-bang" control, the conclusion of 
the proof goes slightly further than is implied solely by PMP. 

• The value-gradient analysis provides an overall view that links several different areas 
within reinforcement learning. For example, the connection between PGL and value 
function learning is proven in Section 12.11 This provides an explanation of what 
happens when the "residual gradient terms" are missed off from the weight update 
equations (i.e. — ► see Section [2TT]) . and a tentative justification for the TD(A) 
weight update equation (see Section |2"7I|) . Also, the obvious similarity of form between 
Eq. [15] and Eq. [27] provides connections between PGL and actor-critic architectures, 
as discussed in Section [3] 

• The use of a function approximator has been intrinsic throughout. This followed from 
the definition of V in Section 11.11 This has led to a robust convergence proof, for a 
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general function approximator and value function, in Section 12,11 The use of a gen- 
eral function approximator is an advancement over simplified function approximators 
(e.g. linear approximators), or function approximators that require a hand-picked 
partitioning of state space. 

Most previous studies have separated the value function update from the greedy policy 
update, but this has been a severe limitation because RL does not work this way; in 
practice, it is necessary to alternate one with the other, otherwise the RL system 
would never improve. However no previous convergence proof or divergence example 
applies to this "whole system". In this paper there has been a tight coupling of 
the value function to the greedy policy, and this has made it possible to successfully 
analyse what effect a value function update has to the greedy policy. We have found 
convergence proofs and divergence examples for the whole system. We do not think 
it is possible to do this analysis without value-gradients, since the expression for ^= 
in Eq. [17] depends on §§. Hence value-gradients are necessary to understand value- 
function learning, whether by VL or VGL. 

By considering the "whole system", a divergence example is found for VL (including 
Sarsa(A) and TD(A)) and VGL with A = 1, in Section l4~3l This may be a surprising 
result, since it is generally thought that the case of A = 1 is fully understood and 
always converges, but this is not so for the whole system on a control problem. 

It is proposed in Section 11.41 that the value-gradients approach is an idealised form 
of VL. We also believe that the approach of VL is not only haphazardly indirect, 
but also introduces some extra unwanted terms into the weight update equation, as 
demonstrated in the analysis at the end of Section 14.11 



The continuous-time policy stated in Section O (taken from iDoval . l2000h is an exact 



implementation of the greedy policy ir(x, w) that is smooth with respect to x and w 
provided that the function approximator used is smooth. This resolves one of the 
greatest difficulties of value-function learning, namely that of discontinuous changes 
to the actions chosen by the greedy policy. 

• A new explanation is given about how residual gradient algorithms can get trapped in 
spurious local minima, as described in Section 12.31 We think this is the main reason 
why residual gradients often fails to work in practice, even in the case of VGL and 
deterministic systems. Understanding this will hopefully save other researchers losing 
too much time exploring this possibility. 

It is the opinion of the author that there are several problematic issues about reinforce- 
ment learning with an approximated value function, which the algorithm VGLf2(l) resolves. 
These problematic issues are: 

• Learning progress is far from monotonic, when measured by either E (Eq. fT2j) or 
K* , or any other metric currently known. This problem is resolved by the proposed 
algorithm, when used in conjunction with a policy such as the one in Section! 
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• When learning a value function with a function approximator, the objective is to 
obtain Gt = G't for all t along a trajectory. In general, due to the nature of function 
approximation, this will never be attained exactly. However, even if this is very close 
to being attained, the value of may be far from optimal. In short, minimising E 
and maximising R T are not same thing unless E = can be attained exactly. As 
demonstrated in the experiment of Section 14.41 an d the proofs in Section 12. 1\ this 
problem is resolved by the proposed algorithm. 

• The success of learning can depend on how state space is scaled. The definition of Qt 
(Eq. fl~9j) resolves this problem. Other algorithms can become unstable without this. 

• Making successful experiments reproducible in VL is very difficult. There are no con- 
vergence guarantees, either with or without bootstrapping, and success often depends 
upon well-chosen parameter choices made by the experimenter. For example, the 
Lunar-Lander problem in Section 14.51 seems to defeat VL with the given choices of 
state space scaling and function approximator. With the proposed algorithm, conver- 
gence to some fixed point is assured; and so one major element of luck is removed. 

All of the proposed algorithms are defined for any < A < 1. The results in Section 
13.11 and Appendix [A] are valid proofs for any A, but the main convergence result of this 
paper, Eq. [TBI applies only to A = 1. Unfortunately divergence examples exist for A < 1, 
as described in Section 14.31 

Also by proving equivalence to policy-learning in the case of A = 1, and finding a lack 
of robustness and divergence examples for A < 1, the usefulness of the value- function is 
somewhat discredited; both for VGL, and its stochastic relative, VL. 
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Appendix A. Optimal Trajectories 

In this appendix we define locally optimal trajectories and prove that if G't = Gt for all t 
along a greedy trajectory then that trajectory is locally extremal, and in certain situations, 
locally optimal. 

Locally Optimal Trajectories. We define a trajectory parametrised by values {xq, ao, fli, 02, ■ ■ ■} 
to be locally optimal if R(xq, ao, a\, a>2, • • •) is at a local maximum with respect to the pa- 
rameters {ao, ai, a2, • • •}, subject to the constraints (if present) that —1 < at < 1. 
Locally Extremal Trajectories (LET). We define a trajectory parametrised by values 
{xq, ao,ai,a2,...} to be locally extremal if, for all t, 




(29) 
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In the case that all the actions are unbounded, this criterion for a LET simplifies to that of 
just requiring = for all t. Having the possibility of bounded actions introduces the 

extra complication of saturated actions. The second condition in Eq. incorporates the 
idea that if a saturated action at is fully "on" , then we would normally like it to be on even 
more (if that were possible). In fact, in this definition R is locally optimal with respect to 
any saturated actions. Consequently, if all of the actions are saturated (for example in the 
case of "bang-bang" control), then this definition of a LET provides a sufficient condition 
for a locally optimal trajectory. 

Concave Model Functions. We say a model has concave model functions if all locally 
extremal trajectories are guaranteed to be locally optimal. In other words, if we define V 1 a R 
to be a column vector with i th element equal to and V a V a i? to be the matrix with 

(i,j) th element equal to g^.g a . , then the model functions are concave if V a i? = implies 
V a V a i? is negative definite. 

For example, for the two-step Toy Problem with k = 1, since i?(xo,ao,ai) = —ao 2 — 

ai 2 — (xo + ao + ai) 2 , we have V a V a R = ^ ^ ^) > which is constant and negative definite; 

and so the two-step Toy Problem with k = 1 has concave model functions. It can also be 
shown that the n-step Toy Problem with any k > 0, and any n > 1, also has concave model 
functions. 

Constrained Locally Optimal Trajectories. The previous two optimality criteria were 
independent of any policy. This weaker definition of optimality is specific to a particular 
policy, and is defined as follows: A constrained (with respect to w) locally optimal trajectory 
is a trajectory parametrised by an arbitrary smooth policy function n(xt, w), where w is the 
weight vector of some function approximator, such that R^(xq,w) is at a local maximum 
with respect to w. 

If we assume the function R n (x,w) is continuous and smooth everywhere with respect 
to w, then this kind of optimality is naturally achieved at any stationary point found by 
gradient ascent on R w with respect to w, i.e. by any PGL algorithm. 

Lemma 5 // G't = Gt (for all t, and some A ) along a trajectory found by an arbitrary 
smooth policy ir(x, z), then G't = Gt = (^|r) t for all t. 

This lemma is for a general policy, and is required by section 13.11 Here we use the 
extended definitions of V' and G 1 that apply to any policy, given in Section [3] and Eq. [26l 

First we note that = G' t with A = 1, since when A = 1 any dependency of 

G' (xt,w, z) on w disappears. Also, by Eq. [6] and since G't = Gt we get, 

G '< = ((I)/ (I), (I) J + ((I)/ (I), (I)J 
Therefore G't is independent of A and therefore G't = Gt = (^§-), for all t. ■ 

Lemma 6 // G't = G t (for all t, and some A ) along a greedy trajectory then G' t = G t = 

m t =m t foraiit. 

This is proved by induction. Note that this lemma differs from the previous lemma in 
that it is specifically for the greedy policy, and the conclusion is stronger. By Eq. [6] and 
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since G't = Gt we get, 

Gt \9s) t (sa) t + (dx) t + (dx) t ° t+1 

The left term of this sum must be zero since the greedy policy implies either = (in 

the case that at is saturated and exists, by Lemma [2]), or = (in the case that 

a t is not saturated, by Lemma [Q). If (§§) 4 does not exist then it must be that A = 0, since 
G't exists, and when A = the definition is G't = • Therefore in all cases, 



Also, differentiating Eq. [T] with respect to x gives 

dx) t (dx) t + \dx\ \dx) t+1 ^ 

So and Gt have the same recursive definition. Also their values at the final time 

step t = F are the same, since = Gf = 0. Therefore, by induction and lemma 

^-^(§)^(^) t forall!. 

Theorem 7 Any greedy trajectory satisfying G't = G t (for all t) must be locally extremal. 

Proof: Since the greedy policy maximises Q(xt,at,w) with respect to at at each time- 
step t, we know at each t, 

) =0 if at is not saturated 



) > if at is saturated and at = 1 (31) 
|j2 ) < if aj is saturated and a t = — 1. 



These follow from Lemma [T] and the definition of saturated actions. Additionally, by Lemma 
El G t = (§§) t for all t. Therefore since, 



8R\ fdr\ + fdf\ fdR\ 



& a J t \da J t \da J t \ dx J 

da) t \da) t 1+1 



t+i 



dQ 
da 

we have (§§) t = (^r) f° r au Therefore the consequences of the greedy policy (Eq. 

l3Tj) become equivalent to the sufficient conditions for a LET (Eq. I29j) . which implies the 
trajectory is a LET. ■ 
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Corollary 8 //, in addition to the conditions of Theorem [7| the model functions are con- 
cave, or if all of the actions are saturated (bang-bang control), then the trajectory is locally 
optimal. 

This follows from the definitions given above of concave model functions and a LET. ■ 

Remark: In practice we often do not need to worry about the need for concave model 
functions, since any algorithm that works by gradient ascent on H* will tend to head 
towards local maxima, not saddle-points or minima. This applies to all VGL algorithms 
listed in this paper, except for residual-gradients. 

Remark: We point out that the proof of Theorem [T] could almost be replace d by use of 



Pontryagin's maximum principle (PMP) (jBronshtein and Semendyayevl . Il985l ). since Eq. 
1301 implies (§f ) 4 is the "costate" (or "adjoint") vector of PMP, and Lemma [6] implies that 
the greedy policy is equivalent to the maximum condition of PMP. PMP on its own is not 
sufficient for the optimality proof without use of Lemma [6J Use of PMP would obviate 
the need for the bespoke definition of a LET that we have used. We did not use PMP 
because it is only described to be a "necessary" condition for optimality, and the way we 
have formulated the proof allows us to derive the corollary's extra conclusion for bang-bang 
control. 



Appendix B. Detailed counterexample of the failure of value- learning 
without exploration, compared to the impossibility of failure for 
value-gradient learning. 

This section gives a more detailed example than that of Fig. [3l to show why exploration is 
necessary to VL but not to VGL. 

We consider the one-step Top Problem with k = 1. For this problem, the optimal policy 
(Eq. [8]) simplifies to 

7T*(x ) = -x /2 (32) 

Next we define a value function on which a greedy policy can be defined. Let the value 
function be linear, for simplicity, and be approximated by just two parameters (101,102), 
and defined separately for the two time steps. 

V{xt,w 1 ,w 2 ) = < Q if t = 2 ^ ' 



For the final time step, t = 2, we have assumed the value function is perfectly known, so 
that V2 = R2 = 0. At time step t = 0, it is not necessary to define the value function since 
the greedy policy only looks ahead. Differentiating this value function gives the following 
value-gradient function: 



G(x t ,wi,w 2 ) 



u>2 if t = 1 
if t = 2 
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The greedy policy on this value function gives 



a =ir(x ,w) 

= encgmax(r(x ,a,w) + V(f(x ,a),w)) 



by Eqs. El E] 



a 



= arg max (—a 2 + w\ + wi (xq + a)) 



by Eqs. M H 



=w 2 /2 



(34) 



Having defined a value-function and found the greedy policy that acts on it, we next 
analyse the situations in the VL and VGL cases, each without exploration. The value 
function defined above is used in the following examples. 

Note that the conclusions of the following examples cannot be explained by choice of 
function approximator for V. For example Fig. [3] shows a counterexample for a different 
function approximator, and similar counterexamples for VL can easily be found for any 
function approximator of a higher degree. A linear function approximator was chosen here 
since it is the simplest type of approximator that can be made to learn an optimal trajectory 
in this problem, as is illustrated in the VGL example below. 

Value-Learning applied to Toy Problem (without exploration): Here the aim is 
to show that VL, without exploration, can be applied to the one-step Toy Problem (with 
k = 1) and converge to a sub-optimal trajectory. 

The target for the value function at t = 1 is given by: 



A simple counterexample can be chosen to show that if VL is complete (i.e. if Vt = V't 
for all t > 0), then the trajectory may not be optimal. If xq = 5, wx = —25, w-i = 
then the greedy policy (Eq. l34|) gives ao = W2/2 = and thus x\ = xq = 5. Therefore 
V\ = V'\ = —25, and V2 = V'2 = 0, and so learning is complete. However the trajectory is 
not optimal, since the optimal policy (Eq. l32|) requires 00 = —5/2. ■ 

Value- Gradient Learning applied to Toy Problem: The objective of VGL is to make 
the value-gradients match their target gradients. For the one-step Toy Problem (with 
k = 1), we get: 



V'i = r(xi,ai) + XV 2 + (1 - X)V 2 



by Eq. H 

by Eq. [3 and since V 2 = V2 = 



2 



The value function at t = 1 is given by: 



Vi = w\ + W2X1 




by Eq. m 

since G' 2 = G 2 = 



by Eq. © 



The value-gradient at t = 1 is given by G\ = W2- 
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For these to be equal, i.e. for Gt = G't for all t > 0, we must have W2 = —2x\. The 
greedy policy (Eq. then gives ao = u>2/2 = —%i = —{xq + ao) a o = —xo/2 which is 
the same as the optimal policy (Eq. I32|) . Therefore if the value-gradients are learned, then 
the trajectory will be optimal. ■ 

Appendix C. Equivalence of V notation in TD(A) 

The formulation of TD(A) as presented in this paper (E q. HUB uses th e V' notation. This can 



be proven to be equivalent to the formulation used by [Sutton (Il988h as follows. Expanding 
the recursion in Eq. [4] gives V't = Ylk>t A fc_ * ( r k + (1 — tyVk+i), so Eq. flQl becomes: 

t>i v j * \k>t j 

: E (%) ( E ^ (r fc + V k+1 ) - E A fc -Ay fe+1 - v) 



a 



E(H) (e^+iw-E^) 

fc>l t=l \ / 1 

E(^«-*)£a"Q 



a 

*>i fe=i 



Th is last line is a batch-update version of the weight update equation given by ISutton 
(1988). This validates the use of the notation V . 



Appendix D. Equivalence of Sarsa(A) to TD(A) for control problems with 
a known model 

Sarsa(A) is an algorithm for control pr oblems that learns to approximate the Q(x,a,w) 
function (IRummery and Niranjanl . Eaai). It is designed for policies that are dependent on 



the Q(x,a,w) function (e.g. the greedy policy or e-greedy policy). 

The Sarsa(A) algorithm is defined for trajectories where all actions after the first are 
found by the given policy; the first action ao can be arbitrary. The function-approximator 
update is then: 

A ^ = «E(S) (Q't-Qt) (35) 



where Q' t = r t + V't+i- 

Sarsa(A) is designed to be able to work in problem domains where the model functions 
are not known, however we can also apply it to the Q function as defined in Eq. [2] that relies 
upon our known model functions. This means we can rewrite Eq. [35] in terms of V(x, w) 
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to become exactly the same as Eq. [TUJ For this reason, we are justified to work with the 
TD(A) weight update on control problems using the e-greedy policy in the experiments of 
this paper. 



Appendix E. One-dimensional Lunar-Lander Problem 

In this continuous-time problem, a spacecraft is constrained to move in a vertical line and 
its objective is to make a fuel-efficient gentle landing. The spacecraft is released from 
vertically above a landing pad in a uniform gravitational field, and has a single thruster 
that can produce upward accelerations. 

The state vector x has three components: height (h), velocity (v), and fuel remaining 
(u), so that xt = (hf, vt,ut) T ■ Velocity and height have upwards defined to be positive. The 
spacecraft can perform upward accelerations at with at € [0, 1]. 

The continuous-time model functions for this problem are: 



f{(h,v,u) T ,a) = (v,(a - k g ),-a) T 
f((h,v,u) T , a) = —(kf)a + f c (a) 



kg G (0, 1) is a constant giving the acceleration due to gravity; the spacecraft can produce 
greater acceleration than that due to gravity, kf is a constant giving fuel penalty. We used 
k g = 0.2 and kf = 2. 

Terminal states are where the spacecraft hits the ground (h = 0) or runs out of fuel 
(u = 0). In addition to the continuous-time reward f defined above, a final impulse of 
reward equal to — v 2 — 2{k g )h is given as soon as the lander reaches a terminal state. The 
terms in this final reward represent kinetic and potential energy respectively, which means 
when the spacecraft runs out of fuel, it's as if it crashes to the ground by freefall. 

f (a) = — fQ 5 g~ 1 (x)dx is the action-cost term of the reward function (as described in 
Section [2T2]) . where g(x) = ^(tanh(x/c)+l) and therefore f c (a) = c\x arctanh(l — 2x) — ^ ln(l 

This means the continuous-time greedy policy is exactly at = g (^—kf + jj^G^J and this en- 
sures at G (0, 1). 

The derivatives of these model functions are: 




E.l Discontinuity Corrections for Continuous-Time Formulations (Clipping) 

With a continuous-time model and episodic task, care needs to be taken in calculating 
gradients at any terminal states or points where the model functions are not smooth. Figure 
[U illustrates this complication when the spacecraft reaches a terminal state. 

This problem means G' changes discontinuously at the boundary of terminal states. 
Since the learning algorithms only use G't for t < F, the last gradient we need is the 
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Figure 8: The lines AB and CD are sections of two trajectories that approach a transition 
to a region of terminal states (the line h = 0, in this case). If the end A of AB is 
moved down then the end B will move down. However if C moves down then D 
will move left, due to the presence of the barrier. This difference is what we call 
the problem of Discontinuity Corrections. 



limiting one as t — > F. This can be calculated by considering the following one-sided limit: 



( dW_ 



F— At 



f dBF 
/i^o+ V dx 



since for small h, At 



F+h/v 



lim 



d 



((kf)a - f° (a))- - i r 



h->o+ dx F+h/ 
( (fcf)^-^(^) + 2 {a F - k g ) 



i -h/v 
(a — k g )h 



F+h/v 



-2v F 

v° 

where a F = lim t _>^- (at). 

Similarly it can be shown that, lim t ^ F ~ (^Jf-), = and therefore the boundary condi- 
tion to use for the target-value gradient is given by 



lim G' t 

t-*F~ 



I (fc/)QF ;; C(QF) + ^ - k g ) 



-2v F 



(36) 



This limiting target value-gradient is the one to use instead of the boundary condition 
given in Eq. [24] or a value- gradient based solely on the final reward impulse. If this issue is 
ignored then the first component in the above vector would be zero and learning would not 
find optimal trajectories. A similar correction needs making in the case of the spacecraft 
running out of fuel. 

Also, in the calculation of the trajectory (Eq. 1201) by an appropriate numerical method, 
we think it is best to use clipping in the final time step, so that the spacecraft cannot, for 
example, finish with h < 0. The use of clipping ensures that the total reward is a smooth 
function of the weights and this should aid methods that work by local exploration. 
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E.2 Lunar-Lander Optimal Trajectories 

It is useful to know the optimal trajectories, purely for comparison of the learned solutions. 
Here we only consider optimal trajectories where there is sufficient fuel to land gently. An 
optimal policy is found using Pontryagin's maximum principle: The adjoint vector p(t) 
satisfies the differential equation (DE) 

% — -(S) (ft) 



dt \ dx J t \dx J t 

and the trajectory state evolution equation (i.e. the model functions), and where the action 
at each instant is found by at = g (—kf + ^Pt^j ■ The trajectory has to be evaluated 
backwards from the end point at a given v F and h = 0. The boundary condition for this 
end point is p F = \\m t ^ F - G't (see Eq. [55]) with a F = g(—kf — 2vf). 

Solving the adjoint vector DE, and substituting into the expression for at gives 

VF 

a t = g (-k f - 2v F + (F - t)f F ) (37) 



P) - I -2v F + (F — t)pp J wilh ///, ^ ■' r 2 (a,.- - /,•„) 



Numerical integration of the model functions with Eq. [371 gives the optimal trajectory 
backwards from the given end point. To find which v F produces the trajectory that passes 
through a given point (Hq^vq) is another problem that requires solving numerically. Some 
optimal trajectories found using this method are shown in Figure [9j 



c=l c=0.01 




Figure 9: Lunar-Lander optimal trajectories. Shows height (y-axis) vs. velocity (x-axis). 

Fuel dimension of state-space is omitted. Trajectories start at diamond and finish 
at h = 0. As c — > 0, trajectories become more fuel-efficient, and the transition 
between the freefall phase and braking phase becomes sharper. It is most fuel- 
efficient to freefall for as long as possible (shown by the upper curved sections) 
and then to brake as quickly as possible (shown by the lower curved sections). 
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Appendix F. Derivation of actor training equation 



The actor training equation (Eq. l27j) is non-standard. However it can be seen to be 
consiste nt with the more st andard equations, while automatically including exploration, as 
follows. iBarto et al.1 (jl983h use the TD(0) error signal 5t = (rt + Vt+i — 14) to train the 
actor, and also specify that some stochastic behaviour is required to force exploration. 

When the domain of actions to choose from is continuous, the simplest technique to 
force exploration is to add a sm all amount of random noise nt at time step t to the action 
chosen (as done by Doya , 2000h giving modifie d actions a\ = at + nt- The stochastic real- 
valued (SRV) unit algorithm (jGullapalll 1 19901 ) is used to train the actor while efficiently 
compensating for the added noise: 



Az = ant 



(r t + VJ+i - V t ) 



Making a first order Taylor series approximation to the above equation, by expanding the 
terms Vj+i and rt about the values they would have had if there was no noise, gives 



Vt. 



Integrating with respect to nt to find the mean weight update, and assuming nt € [— e, e] is 
a uniformly distributed random variable over a small range centred on zero, gives 



(Az) 



Or 



1 . , e 2 {dir 

— Azant = a — — , , . 
2e 3 \oz J t \ \aa 



which is equivalent to Eq. [27] when summed over t. This justifies the use of Eq. [27] and 
explains how it automatically incorporates exploration. ■ 
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